Context

Buying and selling used smartphones used to be something that happened on a handful of online marketplace sites. But the used and refurbished phone market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth $52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used smartphones that offer considerable savings compared with new models.

Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing a smartphone. There are plenty of other benefits associated with the used smartphone market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished smartphones. Maximizing the longevity of mobile phones through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost the cheaper refurbished smartphone segment, as consumers cut back on discretionary spending and buy phones only for immediate needs.

Objective

The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished smartphones. ReCell, a startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to analyze the data provided and build a linear regression model to predict the price of a used phone and identify factors that significantly influence it.

Data Description

The data contains the different attributes of used/refurbished phones. The detailed data dictionary is given below.

Data Dictionary

Importing necessary libraries and data

Check the shape and info

Observations

Observations

Exploratory Data Analysis (EDA)

Explore Numerical Variables

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Observations

Explore Categorical Variables

Observations

Group Brand Names

Observations

Observations

Observations

Data Preprocessing

There are no duplicate rows

Missing Value Treatment

Observations

Observation

Bivariate Scatterplots

Observations

Let's look at graphs with multiple variables with moderate to high correlation with used_price

Observations

Observations

Observations

Observations

Observations

Multivariate Relationships

Observations

Answer questions

To thoroughly analyze the data, address additional questions posed in the Jupyter notebook template provided.

Observations

Observations

Observations

Observations

Data Preprocessing

Outlier detection

Look at outliers in every numerical column

Observations

Linear Modeling without Outlier Treatment

Preparing Data for Modeling

Check coefficients and intercept of the model

Check the performance of the model using different metrics

Observations

Outlier Treatment

Check coefficients and intercept of the model

Check the performance of the model using different metrics

Observations

Linear Regression using statsmodels

Notes

Checking Linear Regression Assumptions

In order to make statistical inferences from a linear regression model, it is important to ensure that the assumptions of linear regression are satisfied. We will check:

  1. No multicollinearity
  2. Linearity of variables
  3. Independence of error terms
  4. Normality of error terms
  5. No heteroscedasticity

Test for Multicollinearity

The above predictors have no multicollinearity and the assumption is satisfied.|

$p$-value

The above process can also be done efficiently using a loop.

Now no feature has $p$-value $\geq$ 0.05, so we'll consider the features in x_train4 as the final ones and olsmod1 as final model.

Observations

Now we will check the rest of the assumptions on olsmod1.

  1. Linearity of variables
  2. Independence of error terms
  3. Normality of error terms
  4. No heteroscedasticity

Test for Linearity and Independence

Why the test?

How to check linearity and independence?

How to fix if this assumption is not followed?

Test for Normality

Why the test?

How to check normality?

How to fix if this assumption is not followed?

Test for Homoscedasticity

Why the test?

How to check for homoscedasticity?

How to fix if this assumption is not followed?

Note: As the number of records is large, we will take 25 random records for representation purpose only

Let's compare the initial model created with sklearn and the final statsmodels model

Log Transformation

new_price and used_price are very skewed and will likely behave better on the log scale. This may be a better than outlier treatment.

All three have helped but the sqrt is not quite strong enough and the result is still a bit skewed, so I prefer the log or arcsinh. The log and arcsinh look similar so the difference there is more be about interpretation. It will likely be easier to explain the log of a number to someone than the arcsinh of a number since that's a less known transformation, so choose log(new_price).

Again, all three have helped but the sqrt is not quite strong enough and the result is still a bit skewed, so I prefer the log or arcsinh. The log and arcsinh look similar so the difference there is more be about interpretation. It will likely be easier to explain the log of a number to someone than the arcsinh of a number since that's a less known transformation, so choose log(used_price).

Check coefficients and intercept of the model

Check the performance of the model

Observations

Linear Regression using statsmodels with log transformed data

The above predictors have no multicollinearity and the assumption is satisfied.|

$p$-value

The above process can also be done efficiently using a loop.

Now no feature has $p$-value > 0.05 so we'll consider the features in x_train_log3 as the final ones and olsmod_log2 as final model

Observations

Check assumptions on olsmod_log2

The above predictors have no multicollinearity and the assumption is satisfied.|

We see no pattern in this plot. Hence, the assumptions of linearity and independence are satisfied

Final Model Summary

Let's recreate the final statsmodels model and print its summary to gain insights.

Conclusions

  1. As new price increases by 1 euro, the used price also increases ~0.9%
  2. With each increase of 1 year for release year, used price increases ~0.6%
  3. As days used increases, used price decreases ~0.1%
  4. Phones with 4g cost ~1% less than those that do not have 4g
  5. Phones with 5g cost ~2.4% more than those that do not have 5g

Actionable Insights and Recommendations

The factors that influence used price the most are the release year, number of days used, new price, and availability of 4g or 5g.